A novel encoder-decoder model based on Autoformer for air quality index prediction

Rapid economic development has led to increasingly serious air quality problems. Accurate air quality prediction can provide technical support for air pollution prevention and treatment. In this paper, we proposed a novel encoder-decoder model named as Enhanced Autoformer (EnAutoformer) to improve the air quality index (AQI) prediction. In this model, (a) The enhanced cross-correlation (ECC) is proposed for extracting the temporal dependencies in AQI time series; (b) Combining the ECC with the cross-stage feature fusion mechanism of CSPDenseNet, the core module CSP_ECC is proposed for improving the computational efficiency of the EnAutoformer. (c) The time series decomposition and dilated causal convolution added in the decoder module are exploited to extract the finer-grained features from the original AQI data and improve the performance of the proposed model for long-term prediction. The real-world air quality datasets collected from Lanzhou are used to validate the performance of our prediction model. The experimental results show that our EnAutoformer model can greatly improve the prediction accuracy compared to the baselines and can be used as a promising alternative for complex air quality prediction.


Introduction
With the sustainable development of the economy, the environmental system on which human beings depend for survival is increasingly challenged by environmental pollution [1]. Air pollution has become one of the biggest threats to human health and life safety. The air quality index (AQI) is an important metric for quantitative evaluation of air quality conditions. It is calculated based on China Air Quality Standard (GB3095-2012) [2] for the six pollutants in the unified evaluation standard:PM 2.5 , PM 10 , CO,O 3 , SO 2 , NO 2 . According to the Technical regulation on ambient air quality index (on trial) (HJ 633-2012) issued by the Ministry of Environmental Protection of the People's Republic of China, the AQI index is divided into 6 levels [3]. Classification standards and scope are shown in Table 1. Having a good environment is the basis of human survival and health, and various diseases have been proven to be closely related to environmental pollution. Therefore, accurate AQI prediction is important for the early warning and management of atmospheric ecology [4].
In recent years, several methods have been proposed to solve the AQI prediction problem. The existing prediction methods are broadly categorized into three types, such as traditional time series, traditional machine learning and deep learning models. The prediction models based on traditional time series methods mainly include nonlinear autoregressive (NAR), autoregressive moving average (ARMA), nonlinear autoregressive moving average (NARMA), autoregressive integrated moving average (ARIMA), etc. Carlos et al. [5] applied the ARIMA model to analyze the PM 10 concentration in high-altitude megacity by evaluating the impact of land surface cover on PM 10 and achieved a better performance. An ARIMA model is employed to predict the air quality in New Delhi, India. The results showed that the ARIMA can capture the non-stationary of air quality and obtain the satisfactory results [6]. Erdinc et al. [7] divided the PM 10 into three levels by utilizing the maximal overlap discrete wavelet transformation (MODWT). For each subseries obtained, the ARIMA model is used for prediction. Bhatti et al. [8] performed an analysis of mass concentration particles through correlations between air pollutants. A seasonal ARIMA (SARIMA) model was constructed and predicted future PM 2.5 . Alyousifi et al. [9] determined the transfer probability matrix of the Markov chain model by the maximum posterior method. This study provided an important reference for scientific prevention and control of air pollution. Compared with traditional statistical methods, machine learning does not need to make any assumptions about the data. Meanwhile, it could achieve accurate prediction results by using cross-validation methods. The traditional machine learning models include logistic regression (LR), decision tree (DT), support vector regression (SVR), random forest (RF), Naive Bayes (NB), K-Nearest Neighbors (KNN), Pugliese et al [10]. Liu et al. [11] used the support vector machine (SVM) model optimized by different algorithms to assist in predicting the PM 2.5 levels and achieved good prediction accuracy. However, when the training sample is large, the memory and implementation of the matrix is a big challenge for SVM algorithms. Xia [12] used RF and cluster analysis methods to investigate the air quality distribution of Changsha and further used ARMA model for prediction. Choubin et al. [13] used multiple machine learning models, which included the bagged CART, mixture discriminant analysis and random forest, to predict the hazard of particulate matter (PM). Liu et al. proposed a fusion model PCR-SVR-ARMA to predict air pollutants that incorporating principal component regression (PCR), SVR, and ARMA [14]. Rajat et al. [15] employed four supervised machine learning methods, which included DT, RF, NB and KNN, for prediction of AQI. The results showed that the DT gave the best performance among all the models. Ma et al. [16] used gradient boosting algorithm to predict the PM 2.5 in the Jing-Jin-Ji area. Their results showed that the model could more accurately predict the next day's PM 2.5 based on the data of the previous 5 days. However, when the data is large, the algorithm will consume a lot of computing time. Ke et al. [17] developed an air quality prediction system based on machine learning for predicting six common pollutants and pollution levels. The seven datasets collected from the typical central cities in China are implemented. Experiment results show that the proposed model can achieve reliable short-term air quality prediction. Traditional machine  learning methods focus on short-term traffic flow prediction and can achieve good prediction  accuracy. However, traditional machine learning models have simple architectures and limited  parameters, and cannot tap into the deeper, implied spatio-temporal correlations in big data,  so they have limited capability for medium-and long-term prediction. In recent years, deep learning (DL) has developed rapidly and has become the newest trends of scientific research. The deep learning models include multi-layer perceptrons (MLP), convolutional neural networks (CNN), long-short-term memory (LSTM), Gated Recurrent Unit (GRU), etc. Compared to traditional machine learning, deep learning methods use deep neural networks to perform more sophisticated processing on the model, resulting in a more powerful feature mining capability. Deep learning has been applied to the fields of meteorology and environmental science. The conditional local convolution recurrent network (CLCRN) [18] were employed for modeling the meteorological flows of local patterns on the whole sphere. Four hour-wise weather datasets including temperature, cloud cover, humidity and surface wind component were used for performance evaluation. Lv et al. [19,20] employed the deep learning for wind speed prediction. The hybrid deep learning models, which combined with feature selection, time series decomposition, and multi-objective parameter optimization, were proposed to predict the wind speed. A location-refining neural network combined the optical flow-based methods with the deep learning-based methods was proposed for the heavy rainfall prediction [21]. The LSTM-based prediction model was employed to estimate sea surface temperatures and predict high water temperature [22].
Air quality prediction involves a variety of factors, including pollutant concentrations and meteorology, and in particular, changes in meteorological conditions can lead to large fluctuations in pollutant concentrations, thus making prediction more difficult. Deep learning models can capture these complex features of air quality. Agarwal et al. [23] used Artifificial Neural Networks (ANN) to predict the pollutant concentration (PM 10 , PM 2.5 , NO 2 , O 3 ) with the data colledcted from Delhi. The model dynamically adjusts prediction with equipped real-time corrections to improve forecast quality. Zhou et al. [24] constructed a deep multi-output LSTM (DM-LSTM) model through deep learning algorithms and predicted the concentration of relevant pollutants in Taipei, Taiwan, which significantly improved the accuracy and stability of air quality forecasting. Aggarwal et al. [25] proposed a hybrid model (P-LSTM) based on LSTM and particle swarm optimization(PSO) to predict the air quality collected from 15 locations in India. Experimental results show that PSO can optimize LSTM network parameters and improve prediction performance. Yan et al. [26] constructed multiple AQI models to predict future data by learning the change regularity of air quality data. The comparison found that LSTM has the best performance. Liu et al. [27] proposed an attention-based air quality predictor (AAQP) to forecast the air quality index of Beijing in the future. Dun et al. [28] proposed a DGC-MTCN model, which combined dynamic graph convolutional network (DGC) and multi-channel temporal convolutional network (MTCN) to predict the PM 2.5 in Beijing and Fushun and achieved good prediction accuracy.
In 2017, the Google team proposed a sequence-to-sequence model with attention mechanisms [29] for machine translation tasks, which changed the previous way of recursive transmission of sequence information and instead processed sequence information as a whole. In 2019, researchers took full advantage of the Transformer and improved the calculation of attention based on the Transformer to accommodate time series data [30]. In recent years, transformer-based models have achieved excellent results in capturing dependencies over long distances, such as Sparse Transformer [31], Reformer [32], Informer [33], Autoformer [34], etc. Various types of transformer-based models are being applied to time series prediction [31][32][33][34][35]. Taking the advantage of hybrid deep learning techniques, this study proposes an AQI prediction approach based on the combination of enhanced feature extraction, cross-stage feature fusion mechanism, data decomposition method, and deep learning model. The main contributions are listed as follow: • We propose a novel encoder-decoder model named as Enhanced Autoformer (EnAutoformer), which is an improvement of Autoformer, to predict the AQI. The EnAutoformer model consists of three major modules: feature extraction and fusion module (CSP_ECC), data decomposition module, and dilated causal convolution module.
• An enhanced cross-correlation (ECC) is proposed for extracting the temporal-dependent features in AQI time series.
• A CSP_ECC mechanism is designed by integrating the cross-stage feature fusion mechanism of CSPDenseNet and ECC mechanism. CSP_ECC is not only able to extract the temporal dependencies in original time series, but also to improve the computational efficiency.
• To further obtain the finer-grained information, the series decomposition is developed to concurrently extract the frequency-domain features including seasonality and trend from original time series.
• A dilated causal convolution network is employed to capture long-range dependencies of original time series, further enhanced the long-term predictive ability of the EnAutoformer model.
• To evaluate the effectiveness of EnAutoformer, the experiments with five real-world air quality datasets collected from different regions of Lanzhou are implemented. Compared with the baselines, experimental results show that our proposed model EnAutoformer achieves significant predictive performance.
The organization of this paper is as follows. Section 2 introduces the methodology and proposes the prediction model. Section 3 describes datasets, baseline models, the experimental settings, and discusses the results of the experiments. The conclusions are given in Section 4.

Enhanced cross-correlation
Enhanced Cross-correlation (ECC) consists of two core modules, a cross-correlation module to detect time-shifted correlations between time series, and a time-delayed aggregation module to aggregate the strongly correlated ones. The structure of the ECC is shown in Fig 1. Cross-correlation is often used to measure the similarity of a time series x(t) and shifted (lagged) copies of a time series y(t) as a function of the lag τ. The lag when the cross-correlation value reaches its maximum is the lag when the two time series are best correlated. The crosscorrelation function R xy (τ) at lag τ is defined as: Fast Fourier Transform (FFT) is an indirect method to calculate the cross-correlation function and the calculation process is shown in the blue block in Fig 1. The FFT and inverse FFT of the discrete signal x(t) can be calculated as: The cross-correlation function can be computed using FFT algorithm based on the convolution theorem, which can be expressed as follows: where FFT x and FFT y are the Fourier transform of x(t) and y(t), respectively, * means complex conjugation and iFFT(�) stands for the inverse FFT. Compared with the direct calculation method, the indirect calculation method can reduce the time complexity of cross-correlation from O(N 2 ) to O(Nlog(N)), so it has obvious superiority in the analysis of large data sample size.

PLOS ONE
Auto-correlation describes the degree of every correlation between two couples of the time series delayed by the lag. Similar to the cross-correlation function, the definition and calculation of the auto-correlation function are respectively as follows: The time delay aggregation is an alignment aggregation of time-shift time series with top k correlation ranking, which is selected by the cross-correlation function. The time delay aggregation (TDA) is expressed as follows [34]: where Topk(�) is the function used to select the top k time series with the strongest correlation. SoftMax(�) is the normalized exponential function. Roll(�) is a function that shifts the time series according to the offset. For the multi-head mechanism,

CSP_ECC
Inspired by CSPNet [36], CSPAttention [35], and auto-correlation mechanism [34], a cross stage partial based on enhanced cross-correlation (CSP_ECC) is proposed to capture the inherent features of AQI time series and solve the problem of high computational complexity. The structure of CSP_ECC is shown in Fig 2. The CSP_ECC consists of two blocks, one of which is an ECC block and the other is a 1 × 1 convolutional layer. CSP_ECC reduces the time complexity by reducing the input dimension [35]. We split the input X 2 R L×d in two parts where L is the input length and d is the input dimension. X L�d=2 top is the input of the ECC block, X L�d=2 bottom is the input of the 1 × 1 convolution block. The outputs of two blocks are concatenated through dimension as the output of the CSP_ECC.

Dilated causal convolution(DCC)
A dilated causal convolutional network is a multilayer convolutional neural network that can be expanded in time-domain [37]. It is employed to process long-range dependent sequences by using a non-recursive method. Dilated convolution allows the model to increase the perceptual field exponentially with fewer layers and maintain computational efficiency.
Given an input sequence X = {x 1 , x 2 , � � �, x N } and the filter F = {f 1 , f 2 , � � �, f k }. The dilation causal convolution on element x t of the input X is defined as: where * d denotes the dilated convolution operator, d is the dilation factor, and k is the filter size. As the depth of the model increases, the dilation factor d increases exponentially, i.e. d = 2 l at layer l. A dilated causal convolution with d = 1, 2, 4 and size k = 2 is shown in Fig 3.

Time series decomposition
Time series decomposition is a very useful method that transforms a time series into multiple subseries representing different characteristics. The characteristics, trends and development patterns of variable changes are extracted from the time series to make effective prediction. There are various decomposition methods for time series. The classical seasonal decomposition is one of the time series decomposition methods. The classical seasonal decomposition method works by applying an additive or multiplicative model to divide a time series into three components: seasonality, trend and noise. In this paper, we perform the time series decomposition using a simplified additive model that decomposes the time series into trend and seasonality. The trend component is obtained by moving average of the time series.
Removing the calculated trend from the time series will produce a new time series called seasonality.

Our proposed model: EnAutoformer
We propose a novel encoder-decoder model named as Enhanced Autoformer (EnAutoformer) for AQI prediction. The structure is represented in Fig 4. The Encoder is stacked by identical encoder layers. Each encoder layer contains three CSP_ECC blocks and two FeedForward_1 blocks.
The l-th encoder layer can be summarized as X l en ¼ EncoderðX lÀ 1 en Þ, where l 2 {1, 2, � � �, N} and X 0 en denote the initial historical series that has been embedded with temporal information. The specific details are as follows: where X l;i en ; i 2 f1; 2; 3g represents the output after the i-th CSP_ECC in the l-th encoder layer. FeedForward_1(�) is a simple feed-forward neural network consisting of an input layer, six hidden layers and an output layer. The FeedForward_1 structure is shown in Fig 5(a).
A decoder block consists of three parts, namely CSP_ECC, SeriesDecomp and FeedFor-ward_2. Supposing the decoder includes M decoder layers. The l-th decoder layer can be described as X l de ; T l de ¼ DecoderðX lÀ 1 de ; T lÀ 1 de Þ, where l = 1, 2, � � �, M. The Decoder(�) is formalized as: where S l;i de ; T l;i de ; i ¼ 1; 2; 3 represent the seasonality and trend after the i-th time series decomposition block in the l-th layer respectively. FeedForward_2(�) is a simple feed-forward neural network, and its structure is shown in Fig 5(b).
The final prediction X pred is the sum of the refined decomposed sequence: The main steps of the prediction model and its pseudo-code are shown in Algorithm

Datasets
Lanzhou City, the capital of Gansu Province, is an important transportation hub in northwest China. It is also one of the important node cities of the Silk Road Economic Belt. Lanzhou has jurisdiction over five districts and three counties. Lanzhou City has a total area of 13,100 square kilometers and a resident population of 4,384,300. Lanzhou is also an important national industrial base for petrochemical, biopharmaceutical and equipment manufacturing. With the continuous and rapid development of social economy and the rapid increase of energy consumption, Lanzhou City is facing more and more environmental pressure, especially the air pollution problem in the urban area is becoming more and more prominent. In this paper, the study was conducted based on the hourly datasets collected from the four districts (Chengguan, Qilihe, Anning, and Xigu) and one county (Yuzhong). The location of air quality monitoring stations in Lanzhou is shown in Fig 6. The AQI from January 1, 2019, to May 31, 2022, was drawn from the web https://www.epmap.org, including O 3 , PM 10 , PM 2.5 , NO 2 , CO, SO 2 , AQI indicators. Table 2 presents the basic statistical characteristics of air quality for five datasets. The missing data are processed by the linear interpolation method. The data is normalized by Z-Score method.

Evaluation metrics
In order to evaluate the performance of the model, three metrics are used to evaluate the model, namely mean squared error (MSE), mean absolute error (MAE) and root mean square error (RMSE). where y i is the actual value of the AQI,ŷ i is the predicted value, and n is the number of samples. The lower value of the MSE, MAE and RMSE, the better performance of the model. Three improvement percentage metrics were also used to present the accuracy improvement of the proposed model compared to the baseline model.

PLOS ONE
where the subscript "prop"in Eqs 24-26 refers to the proposed model, and the subscript "base"refers to the baseline model.

Baselines
To evaluate the prediction performance of the proposed model, we use five baselines for comparison.

Results
The MSE, MAE, RMSE, and the corresponding improvement percentages of the proposed model and baselines are provided in Tables 3 and 4, respectively. The following conclusions can be seen from the Table 1: (1) Compared with the LSTM, Informer, Transformer and Autoformer, the models including the TCCT and our EnAutoformer exhibit better prediction performance in all districts. The major difference between two models (TCCT and EnAutoformer) and the previous four models is the use of CSPDenseNet strategy, which utilizes the cross-stage feature fusion mechanism and integrates the feature maps of each phase of the network. The results show that the model with the CSPNet significantly outperforms the other models in terms of accuracy.
(2) Compared with the TCCT, the MSE, MAE and RMSE reduction realized by the proposed model for all datasets. Although the proposed model achieved a relatively small reduction in MAE of 0.80%, 2.16% and 3.24% for the three datasets (Chengguan,Xigu,Yuzhong) respectively, other metrics suggest that our EnAutoformer significantly outperforms the TCCT. We also analyze the performance of all models for long-term predictions. Table 5 shows the performance comparison of different models under different prediction horizons. In Table 5, 12h,24h and 36h represent the 12-hour, 24-hour and 36-hour prediction horizons, respectively. It can be observed that the accuracy of short-term AQI prediction is higher than that of long-term prediction. As the prediction horizons increases, the prediction performance of all models gradually decreases. Compared with baselines, the all evaluation metrics of our proposed model are the smallest among others. These results indicate that, among the long-term prediction, the EnAutoformer model yields the most accurate results and exhibits an efficient prediction performance.

Conclusions
This study aims at enhancing the prediction performance of air quality by using deep learning. In this paper, we proposed a novel encoder-decoder model named as EnAutoformer to improve the AQI prediction. The encoder layer consisting of several identical blocks stacked together, including the CSP_ECC and FeedForward blocks. The decoder layer consists of

PLOS ONE
several decoder blocks including CSP_ECC, SeriesDecomp, FeedForward and DCC block. The CSP_ECC block, which was based on cross-stage feature fusion mechanism of CSPDense-Net and an enhanced cross-correlation mechanism, is not only able to extract the temporal dependencies in time series, but also to improve the computational efficiency. The time series decomposition was employed to further obtain the intrinsic features of time series including seasonality and trend. The DCC was designed for extracting long-term dependence of AQI. The effective integration of these techniques enhanced the predictive performance of the proposed model. Various metrics like MSE, MAE and RMSE were used for evaluating the proposed model and baselines. Experimental results on real-world show that our EnAutoformer model exhibited the best performance in all datasets and outperformed the existing baselines.
According to the conclusions of this study, future work can concentrate on the following aspects:(1) A shortcoming of the model is that it takes one monitoring location in each district in Lanzhou. The external influencing factors should be added to build prediction models, such as meteorological factors, topography, and geomorphology, etc. Datasets containing rich

PLOS ONE
information can be used in future. (2) Many methods are available to improve the performance and efficiency of deep learning-based predictive models. These methods mainly include data preprocessing [20], deep learning model improvement [32,33,35], improvement of neural networks based on optimization algorithms [38,39], and other hybrid models [25,26,28]. We continue to experiment with various methods to improve model prediction accuracy and efficiency, such as feature selection, multi-objective optimization techniques, model improvement, etc.